Systematic Development of Data Mining-Based Data Quality Tools
نویسندگان
چکیده
Data quality problems have been a persistent concern especially for large historically grown databases. If maintained over long periods, interpretation and usage of their schemas often shifts. Therefore, traditional data scrubbing techniques based on existing schema and integrity constraint documentation are hardly applicable. So-called data auditing environments circumvent this problem by using machine learning techniques in order to induce semantically meaningful structures from the actual data, and then classifying outliers that do not fit the induced schema as potential errors. However, as the quality of the analyzed database is a-priori unknown, the design of data auditing environments requires special methods for the calibration of error measurements based on the induced schema. In this paper, we present a data audit test generator that systematically generates and pollutes artificial benchmark databases for this purpose. The test generator has been implemented as part of a data auditing environment based on the well-known machine learning algorithm C4.5. Validation in the partial quality audit of a large service-related database at DaimlerChrysler shows the usefulness of the approach as a complement to standard data scrubbing. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003
منابع مشابه
Some Notes on Critical Appraisal of Prevalence Studies; Comment on: “The Development of a Critical Appraisal Tool for Use in Systematic Reviews Addressing Questions of Prevalence”
Decisions in healthcare should be based on information obtained according to the principles of Evidence-Based Medicine (EBM). An increasing number of systematic reviews are published which summarize the results of prevalence studies. Interpretation of the results of these reviews should be accompanied by an appraisal of the methodological quality of the included data and studies. The critical a...
متن کاملIdentification of the Patient Requirements Using Lean Six Sigma and Data Mining
Lean health care is one of new managing approaches putting the patient at the core of each change. Lean construction is based on visualization for understanding and prioritizing imporvments. By using only visualization techniques, so much important information could be missed. In order to prioritize and select improvements, it’s essential to integrate new analysis tools to achieve a good unders...
متن کاملData-Driven Approaches to Improve the Quality of Clinical Processes: A Systematic Review
Background: Considering the emergence of electronic health records and their related technologies, an increasing attention is paid to data driven approaches like machine learning, data mining, and process mining. The aim of this paper was to identify and classify these approaches to enhance the quality of clinical processes. Methods: In order to determine the knowledge related to the research ...
متن کاملApplication of Rough Set Theory in Data Mining for Decision Support Systems (DSSs)
Decision support systems (DSSs) are prevalent information systems for decision making in many competitive business environments. In a DSS, decision making process is intimately related to some factors which determine the quality of information systems and their related products. Traditional approaches to data analysis usually cannot be implemented in sophisticated Companies, where managers ne...
متن کاملThe Development of a Critical Appraisal Tool for Use in Systematic Reviews: Addressing Questions of Prevalence
Background Recently there has been a significant increase in the number of systematic reviews addressing questions of prevalence. Key features of a systematic review include the creation of an a priori protocol, clear inclusion criteria, a structured and systematic search process, critical appraisal of studies, and a formal process of data extraction followed by methods to synthesize, or combin...
متن کاملAnalyzing and Investigating the Use of Electronic Payment Tools in Iran using Data Mining Techniques
In today's world, most financial transactions are carried out using done through electronic instruments and in the context of the Information Technology and Internet. Disregarding the application of new technologies at this field and sufficing to traditional ways, will result in financial loss and customer dissatisfaction. The aim of the present study is surveying and analyzing the use of elect...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003